In professional sports, there is a huge interest in attempting to leverage historic statistics to both predict future outcomes (wins/losses) and explore opportunities for tuning or improving a team or individual’s performance. This data-driven approach to sports has gained a large following over the last decade and entered mass media in the form of fantasy leagues, movies (e.g. Moneyball), and websites/podcasts (e.g. FiveThirtyEight). In this analysis, we will be using a classic baseball data set with the goal to build several different models capable of predicting team wins over a season given on other team stats during that season (i.e. homeruns, strikeouts, base hits, etc). The data set containing approximately 2200 records, representing professional baseball team statistics from the years 1871 to 2006 inclusive and each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
We will first explore the data looking for issues or challenges (i.e. missing data, outliers, possible coding errors, multicollinearlity, etc). Once we have a handle on the data, we will apply any necessary cleaning steps. Once we have a reasonable dataset to work with, we will build and evaluate three different linear models that predict seasonal wins. Our dataset includes both training data and evaluation data - we will training using the main training data, then evaluate models based on how well they perform against the holdout evaluation data. Finally we will select a final model that offers the best compromise for accuracy and simplicity.
Describe the size and the variables in the moneyball training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.
Descriptive text … Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce ut augue pharetra, luctus lectus ut, rutrum quam. Aenean quis tellus ac felis accumsan pellentesque id ut purus. Fusce eget ligula eu est congue aliquet. Vivamus hendrerit felis varius lorem suscipit venenatis. Fusce facilisis arcu ac lorem cursus, non pretium velit finibus. Suspendisse eu nulla tellus. Nunc viverra elementum dolor, ut scelerisque nisl. Ut iaculis faucibus ultricies. Praesent fermentum eu libero et consequat. Phasellus vitae euismod lectus, a ultrices dui. Nunc vel leo rhoncus, cursus elit quis, rhoncus nisl. Aenean id urna et nibh tempor iaculis nec non quam. In tincidunt luctus ex eget viverra.
Variables of Interest
Lets look at the first few rows to get a sense for the data.
| INDEX | TARGET_WINS | BATTING_H | BATTING_2B | BATTING_3B | BATTING_HR | BATTING_BB | BATTING_SO | BASERUN_SB | BASERUN_CS | BATTING_HBP | PITCHING_H | PITCHING_HR | PITCHING_BB | PITCHING_SO | FIELDING_E | FIELDING_DP |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 39 | 1445 | 194 | 39 | 13 | 143 | 842 | NA | NA | NA | 9364 | 84 | 927 | 5456 | 1011 | NA |
| 2 | 70 | 1339 | 219 | 22 | 190 | 685 | 1075 | 37 | 28 | NA | 1347 | 191 | 689 | 1082 | 193 | 155 |
| 3 | 86 | 1377 | 232 | 35 | 137 | 602 | 917 | 46 | 27 | NA | 1377 | 137 | 602 | 917 | 175 | 153 |
| 4 | 70 | 1387 | 209 | 38 | 96 | 451 | 922 | 43 | 30 | NA | 1396 | 97 | 454 | 928 | 164 | 156 |
| 5 | 82 | 1297 | 186 | 27 | 102 | 472 | 920 | 49 | 39 | NA | 1297 | 102 | 472 | 920 | 138 | 168 |
Note we will drop the INDEX column as it offers no information towards predictive models.
## [1] 2276 16
Descriptive text … Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce ut augue pharetra, luctus lectus ut, rutrum quam. Aenean quis tellus ac felis accumsan pellentesque id ut purus. Fusce eget ligula eu est congue aliquet. Vivamus hendrerit felis varius lorem suscipit venenatis. Fusce facilisis arcu ac lorem cursus, non pretium velit finibus. Suspendisse eu nulla tellus. Nunc viverra elementum dolor, ut scelerisque nisl. Ut iaculis faucibus ultricies. Praesent fermentum eu libero et consequat. Phasellus vitae euismod lectus, a ultrices dui. Nunc vel leo rhoncus, cursus elit quis, rhoncus nisl. Aenean id urna et nibh tempor iaculis nec non quam. In tincidunt luctus ex eget viverra.
## TARGET_WINS BATTING_H BATTING_2B BATTING_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
##
## BATTING_HR BATTING_BB BATTING_SO BASERUN_SB
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0
## Median :102.00 Median :512.0 Median : 750.0 Median :101.0
## Mean : 99.61 Mean :501.6 Mean : 735.6 Mean :124.8
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## NA's :102 NA's :131
## BASERUN_CS BATTING_HBP PITCHING_H PITCHING_HR
## Min. : 0.0 Min. :29.00 Min. : 1137 Min. : 0.0
## 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419 1st Qu.: 50.0
## Median : 49.0 Median :58.00 Median : 1518 Median :107.0
## Mean : 52.8 Mean :59.36 Mean : 1779 Mean :105.7
## 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682 3rd Qu.:150.0
## Max. :201.0 Max. :95.00 Max. :30132 Max. :343.0
## NA's :772 NA's :2085
## PITCHING_BB PITCHING_SO FIELDING_E FIELDING_DP
## Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
## 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:131.0
## Median : 536.5 Median : 813.5 Median : 159.0 Median :149.0
## Mean : 553.0 Mean : 817.7 Mean : 246.5 Mean :146.4
## 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:164.0
## Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
## NA's :102 NA's :286
## Warning: Removed 3478 rows containing non-finite values (stat_bin).
## Warning: Removed 3478 rows containing non-finite values (stat_density).
## Warning: Removed 3478 rows containing non-finite values (stat_boxplot).
Looking at both the histograms and boxplots, we notice that many of the features are NOT normally distributed (a concern for linear regressions), there are a significant number of outliers that we will have to consider, and we see a large number of zero values for some features. For zero values and outliers, we will want to qualitatively assess whether those make sense and whether we need to transform them in some way.
Here we plot every variable against the target variable number of wins to get a feeling for which variables may be predictive of wins. The plots indicate some clear relationships, such as hitting more doubles or more home runs clearly improves the number of wins.
The plots also reveal significant issues with the data. First, there are many data points that contain missing data that will need to be either imputed or discarded. Second, it appears we have some missing data encoded as 0 and some nonsensical outliers.
There is a team with 0 wins in the dataset. This seems unlikely. Many of the hitting categories include teams at 0; it is unlikely that a team hit 0 home runs over the course of a season.
The pitching variables also include many 0’s, for instance there are multiple teams with 0 strikeouts by their pitchers over the season which is extremely unlikely. The pitching data also includes strange outliers such as a team loggin 20000 strikeouts, that would be an average of 160 strikeouts per game which is impossible. Also team pitching walks and team pitching hits have strange outliers.
Lastly, the error variable makes no sense to me. From my experience watching baseball, teams usually score 2 or less errors per game, which would lead to an overall team error of approximately 320 over the course of a season, which does not match the scale of the error variable.
When we initially viewed the first few rows of the raw data, we already noticed missing data. Let’s assess which fields have missing data.
## values ind
## 1 91.61 BATTING_HBP
## 2 33.92 BASERUN_CS
## 3 12.57 FIELDING_DP
## 4 5.76 BASERUN_SB
## 5 4.48 BATTING_SO
## 6 4.48 PITCHING_SO
## 7 0.00 TARGET_WINS
## 8 0.00 BATTING_H
## 9 0.00 BATTING_2B
## 10 0.00 BATTING_3B
## 11 0.00 BATTING_HR
## 12 0.00 BATTING_BB
## 13 0.00 PITCHING_H
## 14 0.00 PITCHING_HR
## 15 0.00 PITCHING_BB
## 16 0.00 FIELDING_E
Notice that ~91.6% of the rows are missing the BATTING_HBP field - we will just drop this column from onsideration. The columns BASERUN_CS (base run caught stealing) and BASERUN_SB (stolen bases) both have missing values. According to baseball history, stolen bases weren’t tracked officially until 1887, so some of the missing data could be from 1871-1886. We will impute those value. There are a high percentage of missing BATTING_SO (batter strike outs) and PITCHING_SO (pitching strike outs) which seem highly unlikely - we will also impute those missing values.
## values ind
## 1 0.383313355 BATTING_H
## 2 0.285964582 BATTING_2B
## 3 0.225471523 BATTING_BB
## 4 0.186326615 PITCHING_HR
## 5 0.173551988 BATTING_HR
## 6 0.164738136 PITCHING_BB
## 7 0.139073830 BATTING_3B
## 8 0.121110012 BASERUN_SB
## 9 0.105034615 PITCHING_H
## 10 0.009835166 BASERUN_CS
## 11 -0.030050754 FIELDING_DP
## 12 -0.037712096 BATTING_SO
## 13 -0.083637577 FIELDING_E
## 14 -0.089519609 PITCHING_SO
When evaluating features for including in models, we will want to choose those with stronger positive or negaive correlations. Features with correlations closer to zero will probably not provide any meaning information on explaining wins by a team.
When we start considering features for our models, we’ll need to account for the correlations betwwen features and avoid including pairs with storng correlations.
We removed the BATTING_HBP field as it was missing >90% of the data and the INDEX field as it offers no information for a model.
Missing values found in BASERUN_CS (Caught Stolen Bases), FIELDING_DP (Double plays), BASERUN_SB (Stolen Bases), BATTING_SO (Batter Strike outs), PITCHING_SO (Pitcher Strike outs) were all replaced with median values. It is highly unlikely that teams had none of these during an entire season.
There are unreasonable outliers found in PITCHING_SO (pitching strikeouts), PITCHIN_H (allowed hits per game), PITCHING_BB (walks), and FIELDING_E (fielding errors) that exceed what is reasonable or possible given standard game length. While specific games might have outliers (e.g. in a game with extra innings), we wouldn’t expect the totals per season to allow for outliers in every game. Given this, we will replace any outliers with the median fro the data set. Limits we set included: > 4000 PITCHING_SO (25 striekouts per game), > 5000 PITCHING_H (30 hits allowed per game), > 2000 PITCHING_BB (13 walks per game) and > 480 FIELDING_E (3 errors per game).
Using the training data set, build at least three different multiple linear regression models, using different variables (or the same variables with different transformations). Since we have not yet covered automated variable selection methods, you should select the variables manually (unless you previously learned Forward or Stepwise selection, etc.). Since you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.
Discuss the coefficients in the models, do they make sense? For example, if a team hits a lot of Home Runs, it would be reasonably expected that such a team would win more games. However, if the coefficient is negative (suggesting that the team would lose more games), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = clean_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.246 -8.040 0.149 8.459 64.706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.8791382 5.5609672 3.575 0.000358 ***
## BATTING_H 0.0435068 0.0037271 11.673 < 2e-16 ***
## BATTING_2B -0.0181980 0.0091059 -1.998 0.045785 *
## BATTING_3B 0.0887800 0.0166055 5.346 9.87e-08 ***
## BATTING_HR -0.0498772 0.0307896 -1.620 0.105385
## BATTING_BB 0.0668013 0.0051197 13.048 < 2e-16 ***
## BATTING_SO 0.0043819 0.0039536 1.108 0.267841
## BASERUN_SB 0.0199677 0.0041255 4.840 1.39e-06 ***
## BASERUN_CS 0.0054960 0.0154463 0.356 0.722013
## PITCHING_H 0.0031034 0.0009168 3.385 0.000723 ***
## PITCHING_HR 0.0875913 0.0272043 3.220 0.001301 **
## PITCHING_BB -0.0343166 0.0043823 -7.831 7.39e-15 ***
## PITCHING_SO -0.0059484 0.0028921 -2.057 0.039826 *
## FIELDING_E -0.0431197 0.0051761 -8.331 < 2e-16 ***
## FIELDING_DP -0.1453152 0.0132805 -10.942 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.01 on 2260 degrees of freedom
## Multiple R-squared: 0.3141, Adjusted R-squared: 0.3099
## F-statistic: 73.93 on 14 and 2260 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ BATTING_H + BATTING_2B + BATTING_3B +
## BATTING_BB + BASERUN_SB + PITCHING_H + PITCHING_HR + PITCHING_BB +
## PITCHING_SO + FIELDING_E + FIELDING_DP, data = clean_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.051 -7.965 0.164 8.359 65.549
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.7189057 4.9999664 4.344 1.46e-05 ***
## BATTING_H 0.0428227 0.0036457 11.746 < 2e-16 ***
## BATTING_2B -0.0172282 0.0089438 -1.926 0.054197 .
## BATTING_3B 0.0917609 0.0161863 5.669 1.62e-08 ***
## BATTING_BB 0.0643346 0.0041503 15.501 < 2e-16 ***
## BASERUN_SB 0.0209334 0.0039963 5.238 1.77e-07 ***
## PITCHING_H 0.0027985 0.0008467 3.305 0.000964 ***
## PITCHING_HR 0.0466418 0.0083051 5.616 2.19e-08 ***
## PITCHING_BB -0.0323388 0.0034282 -9.433 < 2e-16 ***
## PITCHING_SO -0.0031173 0.0016399 -1.901 0.057446 .
## FIELDING_E -0.0425346 0.0051451 -8.267 2.31e-16 ***
## FIELDING_DP -0.1466897 0.0132112 -11.103 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.01 on 2263 degrees of freedom
## Multiple R-squared: 0.3132, Adjusted R-squared: 0.3098
## F-statistic: 93.8 on 11 and 2263 DF, p-value: < 2.2e-16
We can use a simple decision tre to help visualize feature importance.
For the multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on (a) mean squared error, (b) R2, (c) F-statistic, and (d) residual plots. Make predictions using the evaluation data set.